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Packing several characters into one computer word is a simple and 
natural way to compress the representation of a string and to speed up 
its processing. Exploiting this idea, we propose an index for a packed 

t-H string, based on a sparse suffix tree [S] with appropriately defined suffix 

links. Assuming, under the standard unit-cost RAM model, that a word 

' ' can store up to log CT n characters (a the alphabet size), our index takes 

0(n/log CT n) space, i.e. the same space as the packed string itself. The 

(—— 1 resulting pattern matching algorithm runs in time 0(m + r 2 + r ■ occ), 

where m is the length of the pattern, r is the actual number of characters 
O stored in a word and occ is the number of pattern occurrences. 

1 Introduction 

m 

—i Many application areas, such as genomics or computer security for example, 

face a sharp growth of volumes of available data. Even with the spectacular 
development of hardware capacities, data size often remains a bottleneck for its 

£C) efficient processing, which requires new algorithmic solutions allowing for both 

a compact representation and efficient querying of data. 

With this motivation, research in combinatorial pattern matching recently 
developed different sophisticated methods for efficient compact representation 
sequence data, see the survey [9] and references therein. A basic goal of these 

k>( methods is to index a text (sequence) through a succinct data structure, i.e. 

j_j one taking 0(n log a) bits of memory (as opposed to 0(n log n) bits for classical 

indexes), where n is the text length and a the alphabet size. Thus, succinct 
data structures take asymptotically as muchs pace as the text itself. Still, these 
indexes can efficiently support queries on the text, and primarily the string 
matching operation. As a downside, many of these methods, while being math- 
ematically elegant and highly nontrivial, are too complex to be used in practice. 
However, some of them gave rise to practical implementation and real-life ap- 
plications (e.g. [ol ITT]). 
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One simple idea for saving memory for storing sequence data is to pack 
several characters into one machine word. Under the standard unit-cost RAM 
computation model, it is assumed that the machine word size w is at least log n 
bits, and therefore can store as many as = log CT n characters. Not only 
this saves memory, but also allows one to speed up algorithms, as two tuples 
of characters each stored in one word can then be compared with unit cost. 
For example, [2] recently proposed a modification of the Knuth-Morris-Pratt 
(KMP) pattern matching algorithm for packed strings reporting in 0(nj log CT n+ 
m + occ) time all occ occurrences of a pattern of size m. Like in the regular 
KMP algorithm, the text is not pre-processed and the speed-up is achieved by 
designing a special data structure representing the pattern. 

In this paper, our goal is complementary: based on the character packing, 
we want to propose an index data structure for a text which supports pattern 
matching queries in time linear in m, and at the same time uses 0(n/ \og a n) 
space (i.e. 0(n log a) bits), that is constitutes a succinct index. 

The central idea of defining the index is to partition the text into blocks of 
r characters and to construct a suffix tree which stores only those suffixes that 
start at the block boundaries. Such a suffix tree, called an evenly spaced sparse 
suffix tree, has been first studied in [8] (see also p]). A suffix tree we use here 
differs from that of [5] in the definition of suffix links. 

A sparse suffix tree allows one to easily search for pattern's occurrences that 
start at block boundaries. Therefore, the pattern matching procedure splits up 
into two parts: locating occurrences of the pattern's suffixes P[k+l..m], for k = 
0..r — 1, at block boundaries and then selecting from them those locations which 
are preceded by the corresponding pattern prefix P[l..k]. To solve the second 
task, we use another data structure: the compacted trie of reversed blocks 
augmented by additional arrays assigned to its nodes. Selecting all positions 
corresponding to a given suffix amounts to traversing the trie and recomputing 
the interval of lexicographically ordered suffixes (see details in Section [5]) . This 
is done using a technique inspired from fractional cascading [10], which is also 
closely related to wavelet trees [5] , a popular technique in text compression and 
indexing (see [3]). As a result, we obtain a pattern matching algorithm working 
in time 0(m + r 2 + r ■ occ) while using space 0(n/ log CT n) for storing both the 
text in packed form and working data structures. 

As far as related papers are concerned, similar ideas appear in papers [3j [3 E] , 
although the idea of "character packing" is somewhat implicit there. Compared 
to those papers, our approach is different in several aspects. First, we use 
a sparse suffix tree over the alphabet of characters, rather than a suffix tree 
over the meta-alphabet of r-tuples of characters. Instead of searching for r 
suffixes of the pattern independently, we locate them in a single traversal of 
the suffix tree, using appropriately defined suffix links. Second, we don't make 
use of external orthogonal range queries algorithms (see [5J [7]), but instead 
use specially designed algorithms on "classic" data structures (compacted trie). 
Moreover, we restrict the use of the RAM model to manupulating packed strings 
(i.e. to unit-cost operations on several letters packed into one machine word) and 
indexing packed strings following the Four-Russians idea. However, we don't 
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use special data structures (such as the Geometric Burrows- Wheeler transform 
[3]) involving numeric computations. Overall, we obtain a fully linear pattern 
matching algorithm with respect to the pattern length. 

Let £ denote an alphabet, i.e. a set of letters or characters, of cardinality 
a. We assume a lexicographic order < on S, naturally extended to the set of 
all strings over E. Letters in a string are numbered from 1. 

2 Evenly spaced sparse suffix tree 

We consider evenly spaced sparse suffix trees as defined in [H] . Consider a string 
T[l..n]. Let Suf r be the set of suffixes {T[rj + l..]\j = 0, 1, . . . , f - 1} (assume 
for simplicity that n is a multiple of r). Indexes j will be called ordinals. An 
ordinal j identifies the boundary between positions rj and rj + 1 in T, and the 
corresponding suffix T[rj + 1..]. An r-spaced suffix tree of T, denoted ST r , is a 
compacted trie for the set Suf r . For r = 1, the r-spaced suffix tree is the usual 
suffix tree. Similarly to the regular suffix trees, edges of an r-spaced suffix tree 
are labeled by substrings T[i..j] of T, represented by a pair We define 

explicit and implicit nodes of ST r in the same way as for the regular suffix trees. 
Like in the regular suffix tree, an implicit node will be specified by a pair (v, £), 
where v is the closest explicit ancestor node and £ is the offset with reference to 
v. Note that by definition of the tree, the labels of the outgoing edges of any 
explicit node have different first letters. 

Assuming that the last letter of T is unique, ST r has - leaves and then no 
more than ^ explicit internal nodes. Therefore, ST r takes O(^) space. 

By default, a node may refer to either an explicit or an implicit node. A 
string a is represented in ST r if a is a prefix of one of the suffixes of Suf r , i.e. 
if a is a substring of T starting at a position rj + 1 for some j. In this case, a 
is the label of some node v of ST r , and we say that a is represented by v, and 
\a\ is the string depth of v. 

Similarly to r-spaced suffix trees, we define an r-spaced suffix array. Consider 
the lexicographic order on suffixes Suf r and define 5^4 r [i] = j iff i is the rank 
of T[rj + 1..] in the lexicographic order on Suf r . Since SA r is a permutation 
of the ordinals {0, 1, . . . , ^ — f}, there is an inverse mapping, denoted SA^ 1 . 
Thus, for the suffix of T starting at a position rj + 1 of T, its number in 
lexicographic order on Suf r equals SA' 1 ^}. 

Note that each leaf of the tree ST r represents some suffix of Suf r , and we call 
the rank of a leaf v the rank of the suffix represented by v in the lexicographic 
order on Suf r . Note that the rank of v is equal to S'yl" 1 ^'], where T[rj + 1..] is 
the suffix of T represented by v. 

If the children of each internal node of ST r are ordered by the lexicographic 
order of the labels of corresponding edges, then the leaves of ST r (as occurring 
in the depth-first traversal) become ordered by their ranks. For a node v, we 
define MinRank{v) and MaxRank{v) to be respectively the minimal and the 
maximal rank of leaves in a subtree of ST r rooted at v. The ranks of all leaves 
of the subtree rooted at v form the rank interval [MinRank(v), MaxRank{v)\. 
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If a is a word corresponding to v, then the ranks of suffixes of Suf r starting 
with a are specified by the interval [MinRank(v), MaxRank(v)]. 

We assume that for each explicit node v of ST r , MinRank(v) and 
MaxRank(v), as well as its string depth d(v) can be recovered in constant 
time. This can be trivially achieved by post-processing the tree and storing this 
information explicitly. 

We extend the r-spaced suffix tree ST r with suffix links defined differently 
than in [8]. For each explicit node v representing a string a, a suffix link s(v) 
maps v to a (not necessarily explicit) node labeled with the longest proper 
suffix a[i + 1..] of a represented in the tree. 

Offset i will be called the type of the suffix link. It follows easily that 1 < 
i < r. For each explicit node v of ST r , we store the target node s(v) together 
with the type of the suffix link. 

Given a string T, the r-spaced suffix tree ST r including functions s, SA r 
and SA~ X can be constructed in time 0(n) and space 0("). Due to space 
limitations, the construction is described in Appendix |B| 



3 RightSearch 

Consider a pattern P[l..m]. Using the sparse suffix tree, we locate all occur- 
rences of pattern suffixes P[l..], P[2.. ],..., P[r..] at block boundaries using a 
procedure RightSearch that we describe in this section. Once an occurrence 
of P[k + 1..] is found, the rank interval of P[k + 1..] in SA r is submitted to 
another procedure LeftSearch that selects from it the positions which are 
preceded by occurrences of P[l..fc], thus locating the whole pattern. We will 
say in this case that P occurs in T with fc-offset. LeftSearch will be described 
in Sections 2][5l 

RightSearch proceeds by navigating through ST r trying to locate all nodes 
representing P[l..], P[2. .],..., P[r..]. Starting at the root with P[l..], Right- 
Search follows down the current suffix P[k + 1..] in the tree as long as possible. 

When following an edge in the tree, its label T[i..j] is divided into blocks of 
r letters and each block, except possibly the last incomplete block, is compared 
by a single operation. The last incomplete block is compared letter-by-letter. 

The pseudo-code of RightSearch is given in Algorithm [I] in Appendix [A} 
Assume that RightSearch arrives at some (generally implicit) node (v,£) 
reaching the end of P[k + l..m] (line [l4| of Algorithm [l]) . Then the algorithm 
retrieves the rank interval [MinRank{v'), MaxRank{v')] 1 where v' is the clos- 
est explicit descendant node, which specifies all the occurrences of P[k + 1..] at 
block boundaries. After that, the traversal jumps to s(v) and proceeds with the 
prefix P[k + i + l,m — i + 1] of the current suffix P[k + i + 1..], where i is the 
type of suffix link s(v) (lines [20p2| ) . 



Assume now that RightSearch riches a mismatch while processing current 
suffix P[k + 1..] (line[8|. Assume that the mismatch occurred when visiting 
a node (v,£) and processing a prefix P[k + l..p] of P[k + 1..]. Similarly to 
the previous case, the algorithm jumps to s(v) and proceeds with the prefix 
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P[k + i + l,p — i + I] of the new current suffix P[k + i + 1 ..], where i is the type 
of suffix link s(v). 

Importantly, the described procedure does not miss any occurrences: 

Lemma 1. Algorithm^ correctly identifies all suffixes P[k + 1..], < k < r — 1, 
occurring at block boundaries of T. 

Proof. It is easy to see by induction that once a suffix P[k + 1..] is found (line 
II of Algorithm [T]) , it is represented in the tree and therefore occurs starting at 
a block boundary. 

A key point is that the procedure does not miss any such suffixes. This is 
due to the definition of suffix links: when following a suffix link (lines 20]|22 ), the 



algorithm switches from processing the suffix P[k + 1..] to the suffix P[k + i + l..] 
where i is the type of the suffix link. It follows that no suffix P[k + i' + 1..] for 
i' < i can be represented in the tree. This is because the suffix link points to 
the longest suffix represented in the tree. □ 

Let us now turn to the analysis of the running time of RightSearch. The 
algorithm navigates over the suffix tree ST r by following edges downwards, 
either by chunks of r letters or letter-by-letter, and by following suffix links. 
We analyse separately the traversal of two types of edges: completely traversed 
edges (hereafter traversed edges), and incompletely traversed edges (hereafter 
dead-end edges), either due to a mismatch or due to a found suffix. 

The number of dead-end edges is at most r, as each of them terminates the 
processing of some suffix P[k + 1 ..]. On each such edge, the algorithms makes no 
more that m/r block comparisons and r letter-by- letter comparisons. Therefore, 
the whole time spent on dead-end edges is 0(m + r 2 ). 

The number of all comparisons made along the traversed edges is bounded 
by m, as these comparisons compare different portions of the pattern. In other 
words, the sequence of these comparisons can be associated with moving a 
pointer in the pattern left-to-right, either by blocks of r letters or by single 
letters. The whole time spent on these comparisons is thus 0(m). 

Theorem 1. RightSearch computes the rank intervals of all suffixes P[k + 
I..], < k < r — 1, occurring at block boundaries ofT in time 0(m + r 2 ). 



4 Compacted trie 

RightSearch, described in the previous section, computes rank intervals of all 
pattern suffixes P[k + X..] occurring at block boundaries of T. For each such 
suffix, procedure LeftSearch is called, which selects, from this rank interval 
of S A, those boundary positions which are preceded by P[l..fc]. LeftSearch 
is based on another data structure - compacted trie of reversed blocks - which 
we describe in this section. 

The data structure, denoted CT r , is based on a compact trie storing all the 
blocks of T written in reverse order, i.e. all the strings Tj = (T{r(j — !) + !.. rj]) R 
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for j = 1, . . . , ^. Since there are ^ blocks overall, CT r has no more than - leaves 
and therefore takes O(^) space. 

For each node v of CT r , let l(v) be the string represented by v in CT r , and 
let d(v) = \l(v)\ be the string depth of u. 

CT r is used by LeftSearch in a natural way: in order to find occurrences 
of P[l..k] ending at block boundaries, we look up its reverse P[k]P[k— 1]..P[1] in 
CT r by following the corresponding branch. However, selecting efficiently those 
occurrences which belong to the rank interval computed by RightSearch is a 
non-trivial task which can be reduced to some kind of orthogonal range queries 
problem (see [31 [Z])- We propose a more efficient direct solution inspired by 
the technique used in |10j for the problem of 3-D-dominance reporting. The 
technique is also somewhat similar to wavelet trees [5] which have become a 
popular tool in text compression and indexing (see [9]). 

For each node v of CT r , let Ord v —< ji, j2, ■ ■ ■ )3nM > be an ordered set of 
all ordinals j such that l(v) is a prefix of r,-, that is (l(v)) R occurs in T ending 
at a position rj. The order of ordinals j\, j2, . . . ,j_NM 1S defined by their rank 
in the lexicographic order on T[rj + 1„] (see Sectional). In other words, 

SA- 1 ^} < SA- 1 ^} <...< SA- 1 \j N(v) }. 

Ord v is not stored explicitly for internal nodes of CT r but is stored explicitly 
for the leaves. For each leaf v of CT r , Ord v is stored as an array of N(v) entries 
containing ji,j2,---ijNM m order. For each internal node v, we store two 
arrays, p v and c v , that we describe now. 

The array p v contains letters stored in packed form, i.e. each entry of p v is a 
machine word that stores some fixed number of letters to be defined later. The 
letter sequence stored in p v is defined as follows. If Ord v —< ji, ■ ■ ■ ,jN(v) >; 
then p v contains letters T[rji — d(v)], T[rj 2 — d(y)], . . . , T[rjjft v -\ — d(v)] in this 
order. 

Array p v provides information necessary for choosing an appropriate child 
when navigating down through CT r . If u\,U2, . . . , u* are children of v, then 
Ord v = Ord Ul l+l Ord U2 W ... W Ord Ut . Consider j <E Ord v . Then j £ Ord Ui iff 
T[rj — d(v)] is the first letter of the label of the edge (v, u t ). 

We now define the size of p v which is determined by the number of letters 
stored in one entry. Our basic assumption is that machine word is at least log n 
bits, and therefore can hold at least = log CT n letters. Then, assuming 

r < log CT n insures that a block of r letters can be compared with unit cost, 
which is our primary condition (see Introduction) . Define the number of letters 
to be stored in one entry of p v to be r/2. The size of p v is then 2N ^ machine 
words. 

The second array c v is a 2-dimensional array of integers. For each letter b G S 
and each j = 1,2,..., 2N ^ , c v [b, j] stores the number of occurrences of letter b 
in the first j machine words of p v . It takes 2<jN J- v ^ memory words to store c v . 

The number of letters inside each entry of p v is preprocessed separately, 
following the Four-Russians idea. Formally, on top of the data structure CT r 
described above, we store a 3-dimensional array C defined as follows. For each 
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possible instance u of entry of p v , each letter b £ £ and each j — 1,2, ... , r/2, 
C[u, b,j] stores the number of letters b among the first j letters contained in u. 

Lemma 2. The array C takes o(^) space. 

Proof. As each entry of p v contains r/2 letters, the number of possible entries 
is er r / 2 . Therefore, the size of C is a r / 2+1 (r/2). As r < log CT n., the size of C 
isO(n 1 / 2 logn)=o(^). □ 

The following lemma summarizes the space taken by all the data structures. 

Lemma 3. Data structure CT r and the array C take space 0(~) altogether. 

Proof. The compact trie CT r has - leaves and then no more than - internal 
nodes. The ordinal sets stored at leaves are pairwise disjoint and hold all the 
ordinals j = 1, .. . , - altogether, therefore their total size is -. 

Each letter of T appears at most once in all arrays p v , which implies that 
the overall number of letters in all p v is 0(n). Therefore, the overall size of 
all p v is O(-). For the same reason, the overall size of arrays c v is O(-) too. 
The size of C is o(— ). The Lemma follows. □ 

In Appendix [C] we will show how to construct CT r including all the arrays 
p v , c v and C, in Oin) time and O(^) space. 



5 LeftSearch 

We now describe the algorithm LeftSearch. Recall that RightSearch lo- 
cates all suffixes P[k + 1..], < k < r — 1 occurring at block boundaries. 
For each such suffix P[k + 1..], RightSearch outputs the corresponding rank 
interval [LB k ,RB k ] in SA r such that P[k + I..] occurs precisely at positions 
{rj + l\j £ SA r [i],LB k < i < RB k }. The goal of LeftSearch is to select 
from this set those positions which are preceded by the prefix P[l../c]. 

Let us fix some k, 1 < k < r — 1. (For k = 0, RightSearch finds the entire 
occurrence of P and therefore LeftSearch is not needed.) As already men- 
tioned in Section [4j the general idea of LeftSearch is intuitive: the algorithm 
simply follows P[k]P[k — 1] . . . P[l] in CT r starting from the root. If this word 
is not represented in CT r , then P[l..fc] does not have any occurrences ending at 
block boundaries. 

Assume that vq , v\ , . . . , V£ are nodes of C'T r traversed when following 
P[k]P[k — I] . . . P[l]. Consider a node Vi of string depth d{vi) and the asso- 
ciated ordered set Ord Vi . The following statement holds. 

Lemma 4. The ordinals j such that P[k — d(vi) + l..k] is a suffix ofT[..rj] and 
P[k + I..] is a prefix ofT[rj + 1..] form an interval of Ord Vi . 

Proof. Ord v . is, by definition, the set of ordinals j such that P[k — d(vi) + l..k] 
is a suffix of T[r(j — I) + l-.rj]. These ordinals j are ordered in Ord Vi according 
to the lexicographic ordering of suffixes T[rj + 1..] of T. Those suffixes which 
start with P[k + 1..] form then an interval of Ord Vi . □ 
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Lemma [I] provides the key idea of Left Search: when following P[fc]P[fc — 
1] . . . P[l] in CT r for each visited node Vi maintain the interval of Ord Vil which 
contains those ordinals j of Ord Vi for which P[k + 1..] is a prefix of T[rj + 
1..]. Note that since sets Ord Vi are not stored explicitly for internal nodes, the 
algorithms actually manipulates indexes 1, . . . ,N(yi) of ordinals Ord v . rather 
than ordinals themselves. More precisely, when visiting a node Vi, the algorithm 
will compute the corresponding interval [LB(vi),RB(vi)] of [l..N(vi)]. 

The traversal of CT r by word P[k]P[k— 1] . . . P[l] starts at the root node vq- 
The set Ord VQ is the set of all ordinals {0, . . . , ™ — 1} ordered by the lexicographic 
order of suffixes P[rj + 1..], that is Ord VQ =< SA[i]\i = 1..- >. The initial 
interval [LB(vq),RB(vo)} of Ord Vo should contain those ordinals j for which 
P[fc + 1..] is a prefix of T[rj + 1..], which is precisely the rank interval [LB k , RB k ] 
computed by RightSearch (see Algorithm [lj . 

Let us now focus on the key step of LeftSearch which consists in updating 
the current interval [LB(vi), RB(vi)} when moving in CT r from a current node 
Vi to one of its children. 

Let Vi be an internal node of CT r reached by word P[k] . . . P[k— d(vi)+l], and 
[LB(vi), RB(vi)\ be the interval of Ord Vi computed by the algorithm. Consider 
the next letter a = P[k — d(vi)], and the child Vi+i reached by P[k] . . . P[k — 
d(vi) + l]P[k - d(vi)) . . . P[k - d(v i+1 ) + 1]. Let a be the first letter of the label 
of edge (vi,v i+ i), i.e. a — P[k — d(vi)]. The following lemma shows how to 
compute the interval [LB(vi+i), RB(vi+i)\ from the interval [LB(vi), RB(vi)]. 



Lemma 5. LB(vi+\ 
following formulas: 



,RB(vi+i) can be obtained from LB(vi), RB(vi) by the 



LB(v i+l ) = c Vi [a 
RB(v i+1 ) = c Vz [a, 



r/2 
RBjvj) 
r/2 



C[p Vi )[ 



LB{vj) 

r/2 
RB{vj) 

r/2 



,a,LB( Vi )\(r/2)}, (1) 
,a,RB( Vl )\(r/2)}, (2) 



where \ denotes the remainder of integer division. 



Proof. Let us first analyse how Ord Vi+1 is related to Ord Vi . It is easily seen that 
Ord Vi+1 C Ord Vi and the order of elements of Ord Vi+1 is preserved in Ord Vi . 
Furthermore, j G Ord Vi+l iff j £ Ord v . and T[r(j — 1) — d(«j)] = P[k — d(vi)]. 
(Note that since there are no branching nodes between Vi and w^+i, this means 
that T[r(j - 1) - d(v i+ i) + l..r(j - 1) - d( Vi )] = P[k - d(v i+1 ) + h.k - d(«<)] 
and then T[r(j - 1) - d(v i+1 ) + l..r(j - 1)] = P[k - d(v l+1 ) + l..k}.) Therefore, 
computing interval [L5(%i), RB{vi + \j\ from [LB(vi), RB(vi)] can be done 
through counting the number of a's among all the letters stored in p Vi and 
within intervals of p v . defined by positions LB(vi), RB(vi). These counts can 
be retrieved from auxiliary arrays c Vi and C. 

Recall from Section [4] that letters are stored in p Vi in packed form, by r/2 
letters in each machine word. The number of a's within several consecutive 
machine words is provided by arrays c Vi , whereas the number of a's within a 
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Figure 1: Illustration to the key step of LeftSearch (Lemma [5]). In this exam- 
ple, r = 4 and each machine word of p v stores 2 letters. [LBfoi), RB(vi)] — [2, 7]. 
The first letter of the label on the edge (vi,Vi+i) is b. Therefore, LB{vi + i) = 1 
as there is one letter b among the first two letters of p Vi . RB(vi+±) is computed 
similarly. Arrays Ord v are not stored for internal nodes and are shown here for 
explanatory purposes only. 



part of one machine word is provided by the array C. Formulas ((T]),((2]) follow. 
The computation is illustrated in Figure [l] □ 

Algorithm [2] in Appendix [A"| shows the pseudo-code of LeftSearch. Based 
on Lemma [5j LeftSearch recomputes, using formulas p^pl), the current in- 
terval [LB(vi), RB(vi)] at each node Uj along the traversal of the branch of CT r 
defined by word P[k] . . . P[l]. If at some point the interval [LB(vi),RB(vi)] gets 
empty, this implies that there is no occurrence of P with fc-offset, and Left- 
Search terminates. Once the terminal node vi is reachecQ the algorithm has 
identified a subtree of CT r such that the leaves of this subtree store the ordinals 
j corresponding to the occurrences of P[l..fc] ending at block boundaries. For 
each such leaf u, the set Ord u of these ordinals is explicitly stored in an array. 
However, similar to internal nodes, for each leaf u we have to compute the in- 
terval [LB(u),RB(u)] defining the ordinals of interest. This is done in the same 
way as before, namely by traversing down the branches of CT r and updating 
the interval using formulas (JT]) , ([2J . The only difference is that starting from V£ 
we need to explore all branches of CT r , rather than only one branch determined 
by the word P[k]..P[l]. 

Thus, the algorithm proceeds with exploring all the branches of the subtree 
defined by vg and performing the computation of Lemma [5] for all the children 
of each node, rather than for only one child as before. An obvious optimization 
here is that once a current interval [LB(v), RB(v)] gets empty for some node v, 
the algorithm stops exploring the subtree oiv, as none of its leaves can contain 
the desired ordinals. The traversal of the subtree of vi is done by an auxiliary 

1 The node reached after following T[fe]..T[l] can be an implicit node of CT r , in which case 
the algorithm moves on to the closest explicit descendant node (lines |16|18| in Algorithm |2|| 
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procedure Traverse shown in Algorithm [3] in Appendix. 

We are left with the analysis of the running time of LeftSearch. Since the 
computation of Lemma [5] is done in constant time, the traversal of vq, «i, • • • , vi 
is done in time O(k), that is 0(r). Starting from v£, the algorithm explores the 
corresponding subtree but once the current interval [LB(v), RB(v)] gets empty, 
the subtree of v is pruned out. 

Let us call a node v (internal or a leaf) non-empty if the corresponding 
interval [LB(v), RB(v)] is non-empty. Observe that for a non-empty internal 
node at least one of its descendants is non-empty. This means that there is 
no non-empty internal nodes outside the paths leading to non-empty leaves. 
Processing every non-empty internal node requires 0(a) time, which is the time 
to examine its ancestors. The whole traversal of the subtree of vg takes time 
0(<jr) per non-empty leaf. 

Since every non-empty leaf defines at least one fc-offset occurrence of P, the 
total running time of LeftSearch is then 0(r + r ■ occfc), where ocCfc is the 
number of resulting fc-offset occurrences. 

6 Resulting bound 

Theorem 2. Searching for all occurrences of P inT using algorithms Right- 
Search and LeftSearch takes time 0(m + r 2 + r ■ occ), where occ is the total 
number of output occurrences. 

Proof. Time taken by RightSearch is 0(m + r 2 ). There are at most r calls to 
LeftSearch that take time 0(r-r + r^2 k occk) — 0{r 2 + r ■ occ). The theorem 
follows. □ 

Note for completeness that we have always assumed that the pattern length 
m is larger than r and therefore must cross at least one block boundary. In 
case m < r, all occurrences of P located inside blocks can be reported in time 
0(m + occ) (see [3]). 

7 Concluding remarks 

In this paper, we proposed a compact indexing scheme supporting linear-time 
string matching. The guiding idea is the packing of several characters into one 
machine word and the use of the sparse suffix tree based on partitioning the text 
string into blocks of equal size r. The core of the algorithm is the procedure 
RightSearch computing, in a single traversal of the sparse suffix tree, all 
the suffixes P[k + 1..], < fc < r, in time 0{m + r 2 ). For each such suffix, 
procedure LeftSearch is called which selects those occurrences of P[k + 1..] 
which are preceded by P[l..fc]. All resulting occurrences are then reported in 
time 0(m + r 2 + r ■ occ). 

One of our goals was to design a simple dedicated algorithm that does not 
call for complex external subroutines, such as the one supporting orthogonal 
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range queries. The obtained solution, however, remains somewhat complex: in 
particular, the additional compact trie data structure and the implementation 
of LeftSearch represent a complex step. We believe that this could be further 
improved leading to a simpler algorithm possibly using only one data structure. 
Such a simplification could possibly lead to getting rid of the r factor in the 
r ■ occ term of the complexity bound, thus yielding a fully linear solution both 
on the pattern length and the number of pattern occurrences. This constitutes 
a challenging problem for future research. 
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A Pseudocodes 



Algorithm 1 RightSearch 

1: k<-l 

2: p <- 1 

3: Vertex 4— root 

4: VertexOffset <- 

5: while k < r do 
6: while p < to do 

7: follow down the current edge of ST r by comparing at once r characters 

P[p..p + r — 1] if possible, or one character P[p] otherwise 
8: if mismatch occurred then 
9: break the while-loop 

10: else 

11: update Vertex, VertexOffset, p 

12: end if 
13: end while 

14: if p = to then 

15: if VertexOffset ^ then 

16: Descendant closest explicit descendant for (Vertex, VertexOffset) 

17: end if 

18: LEFTSEARCH(fc, MinRank(Descendant), MaxRank(Descendant)) 

19: end if 

20: p <r- p - VertexOffset + 1 

21: (Vertex, VertexOffset) s(Vertex) 

22: k k + type of the suffix link ( Vertex, s( Vertex)) 

23: end while 
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Algorithm 2 LeftSearch(/c, L, R) 



1: CurrVertex 4— root 
2: LB <- L 
3: RB ^ R 

4: i «- fc 

5: while i > 1 and [LB, RB] is not empty do 

6: if P[i] matches an outgoing label then 

7: move down by P[i] 

8: update CurrVertex 

9: update LB, RB using formulas (1),(2) 
10: i <- i — 1 
11: else 

12: return no occurrences 
13: end if 
14: end while 

15: if [LB,RB] is not empty then 
16: if CurrVertex is implicit then 

17: CurrVertex closest explicit descendant of CurrVertex 
18: end if 

19: Traverse (CurrVertex, LB, RB) 
20: end if 



Algorithm 3 Traverse( Vertex, LB, RB) 

1: for all sons u of Vertex do 

2: Compute LB(u), RB(u) using formulas (1),(2) 

3: if [LB(u), RB(u)] is not empty and u is not a leaf then 

4: Traverser, LB(u), RB(u)) 

5: else if u is a leaf then 

6: for all i e do 

7: output Orrf„[i] 

8: end for 

9: end if 

10: end for 



13 



B Construction of ST r 



In this section we describe the construction of the suffix tree ST r , as defined in 
Section [2j for a given text string T. 

Algorithms to construct the sparse suffix tree in time 0(n) and space 0(n/r) 
have been proposed in [5] (see also PQ). However, the definition of sparse suffix 
tree from [5] differs from ours in the definition of suffix links. Specifically, 
according to [5], a suffix link from an explicit node representing a string a 
points to a node representing a[r + 1..]. We call such suffix links r-suffix links. 
The definition is well-founded, as implied by the following lemma: 

Lemma 6. If a string a, \a\ > r, is represented in ST r , then the string a[r + l..] 
is represented in ST r too. Moreover, if a is represented by an explicit node, then 
the same holds for a[r + 1..]. 

Assume that the sparse suffix tree together with r-suffix links has already 
been constructed by the algorithm of [5] . To obtain ST r , we have only to set the 
suffix links as defined in Section [2j We will be setting suffix links consecutively 
for type 1,2,..., r. 

For each explicit node v of ST r , we fix one of the occurrences of the rep- 
resented string l(v) in T starting at a block boundary. We then compile an 
array Q of - lists of nodes of ST r . A node v belongs to the £th list iff the fixed 
occurrence of l(v) starts at position ir + 1 of T. We assume that nodes in each 
list of Q occur in the increasing order of string depths. Q can be compiled by 
one breadth-first traversal of ST r in O(^) time. 

Consider some i, < i < r — 1. Let {5\ , < j < — — 1 be the longest prefix of 
T[rj + i + l..] represented in ST r . At the first step of the construction, the algo- 
rithm locates the (possibly implicit) nodes vq, v±, . . . , t?a_i of ST r representing 
(3f,(3j,..., /3[ _1 respectively. These nodes are used to build suffix links of type 
i. 

The following lemma can be proved: 

Lemma 7. The nodes v ,vi, . . . , ua_i can be located in time 0("). 

We leave the details of the proof for an extended version of the paper. 
The second step is to build suffix links of type i using the nodes 
V ,Vi,...,Vn.-i. 

Lemma 8. Let u and v be two explicit nodes such that u is an ancestor of v 
(that is, l(u) is a prefix of l{v)). Then the type of the suffix link of u is not 
larger than the type of the suffix link of v. 

The Lemma will insure that all nodes with suffix links of type i occur con- 
secutively in the initial part of lists Q[j] (note that by induction, the nodes with 
suffix links of type smaller than i have been deleted from lists Q[j], see below) 
and if the head element of some Q[j] does not have a suffix link of type i, then 
no other element of Q[j] has one. Note also that a suffix link of type i of some 
node v in Q[j] must point to a node on the path from the root to Vj. Hence, the 
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main idea is to maintain a stack of nodes on the path from the root of ST r to 
Vj to compute suffix links of type i for nodes of Q[j}. Note that Vj's are implicit 
nodes in general, therefore some additional care is needed for this procedure. 

In more details, we traverse ST r depth-first and maintain a stack V (im- 
plemented as an array, i.e. allowing access to all stored elements) of size 0(") 
storing explicit nodes on the path from the root to the the current node of ST,.. 
Assume that we are in a node Vj representing /3|, < j < - — 1. We check 
the head element v of the list Q[j}. If the string depth d(v) is less than d(vj), 
then the type of a suffix link from v is i. We find the first node u on the path 
from the root of ST r to Vj with string depth bigger than d(v) by a binary search 
on the elements of V. Obviously, the target node s(v) is a (possibly implicit) 
node (u,d(u) — d(v)). After computing s(v), v is deleted from Q[j}. We repeat 
this procedure while string depth of the head element is less than d(vj) and then 
continue the tree traversal. 

Let us now turn to time and space analysis. We need 0(n) time and 
O(-) space for construction of ST r and SA r for a string T of length n. To 
compute r-suffix links we need O(f) time. To locate nodes Vo,Vi, . . . , for 
a fixed i, we need 0{— ) time, and, therefore, 0(n) time for all i, < i < r. To 
compute all suffix links, we need 0(^ • log ^ + n) = 0(n) time. Finally, to store 
V and Q during tree traverses we need 0{~) space. 



C Construction of CT r 

In this section, we describe a construction algorithm for CT r (Section [4]). First, 
note that the trie CT r for a string T without additional arrays that we need can 
be constructed straightforwardly in time 0(n) and space 0(™). Assume now 
that the trie has been constructed. We show how to augment it with arrays c„, 
p v and Ord v . 

First, we compute the string depth for all nodes of the trie, which can be 
done in 0(") time by a depth- first traversal. The algorithm will proceed by 
depth levels, computing the auxiliary arrays for all nodes of depth 1, 2, etc. 
Note that the arrays Ord v are also stored explicitely for each level during the 
construction procedure, but are erased after processing the level (except for the 
leaves), for the sake of space economy. For each node of the current level, we 
store Ord v in lexicographical order and arrays p v and c v (c v is computed right 
after computing p v by one pass through p v in time N(v)). 

It is enough to show that if we have computed arrays c v , p v and Ord v for a 
node v, then we can compute these arrays for each of its children u. Consider 
Ord v =< ji, j2, ■ ■ ■ , jN(v) >■ By definition, a leave labeled with Tj k is in a 
subtree of CT r with the root u such that the label of the edge from (v, u) starts 
with letter Therefore, we read p v and copy to Ord u , where the first 

letter of the label on the edge from v to u is p v [k] (note that u is unique) . After 
that, we write the letter Tj k [d(u) + 1] into p u . 

To finish, we delete Ord v and compute the array c u . All in all, we spend O(^) 
time for computing arrays of each next level. Since there are no more than r 
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levels, we need O(n) time for computing additional arrays for CT r . Note that 
arrays Ord v for the leaves will be built automatically. Construction of the array 
C in time 0(n) is trivial. 
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